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Multicore & big.LITTLE 

The case for multiprocessing 

■ Platform trends 

■ Clear rise in quad+ cores from mid to high-end 

■ Everything’s getting bigger - LTE, GPU, camera, display 

Single thread performance improvements diminishing - focus on multi-core 

It is not just about performance - thermally constrained use cases are now commonplace 

Software trends 

OS vendors taking more advantage of multicore 

■ Wider awareness of multiprocessing support libraries 
Increased combined use of devices - e.g. augmented reality 
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Multiprocessing 

Taking advantage of parallelism 


In the Core 


NEON/SIMD 




Use of common 
parallelizing tools 


OpenMP, 
Renderscript, 
OpenCL, etc. 




Multi-threading 
where possible 


Never easy, 
but 

increasingly 

necessary 
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Looking Ahead - Multi-core Trends for 2014-5 

■ Cortex-AI 5/Cortex-A7 big.LITTLE is the premium product in 2014 

■ Range of core count: 4 (2+2), 6 (2+4) and 8 (4+4) cores 

■ Cortex-A 1 7/Cortex-A7 (32b) coming in 2015 

■ ARMv8-A (64b) chipsets emerging across all segments in 2014 

Quad and Octa-core Cortex-A53 coming into entry level and mid-range 



■ High-end mobile expected to move to A57 and A53 big.LITTLE for 2015 

■ Multiple big.LITTLE topologies expected 

Relative performance 


r 


A1GDI/1TGK 


A 


■ New LITTLE processors offer similar 
performance to Cortex-A9 

■ Significant performance boost 
with big processor e.g. Cortex-A 1 5 



Cortex-A7 Cortex-A9 Cortex-A15 
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A big. LITTLE System 

Programmer’s view of hardware 

■ High performance Cortex-A57 CPU cluster 

■ Energy efficient Cortex-A53 CPU cluster 

■ CCI-400 maintains cache-coherency between clusters 

■ GIC-400 provides transparent virtualized Interrupt control 



CCI-400 Cache Coherent Interconnect 
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big.LITTLE 

The evidence from a 4+4 MP system vs Quad Cortex-AI5 


CPU Power Saving ■ SoC Power Saving 


90 % 
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50 % 


Improvement over A15 alone 


> A threads, MP capacity 
advantage 



<= 4 threads, no 
degradation in 
MP 
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big.LITTLE Development 

General advice for Global Task Scheduling (GTS) 

■ Trust the scheduler... ■ LITTLE cores are great 

■ Linux will schedule for performance and ■ You’ll be using them a lot 

efficiency ■ Cortex-A53 ~20% greater perf than Cortex- 

■ All new tasks started on big to avoid A9 

latency ■ Most workloads will run on LITTLE 

Quickly adapts to a task’s needs ■ More thermal headroom for other SoC 

components 

■ ...Unless 

■ You know a thread is intensive but not ■ big cores are serious powerhouses 

urgent ■ Think of them as short-burst accelerators - 

■ Affine to LITTLE, never to big e.g. Physics based special effects 

■ e.g. Maybe use this for asset loading on a ■ Think about the trade offs during design 

separate thread 
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big.LITTLE Development 

Things to avoid 

■ Imbalanced threads sharing common data 

Cluster coherency is excellent but not free 

■ If you have real-time threads note that... 

RT threads are not auto migrated 

■ RT threads are a design decision, think carefully about affinity 
http://linux.die. net/man/2/sched_setaffinity 

AndTBB too https://www.threadingbuildingblocks.org/ 

■ Avoid long running tasks on big cores 

You’ll rarely need that processing power for long periods 

■ Can the task be parallelized? 
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Takeaways 


■ big.LITTLE & Global Task Scheduling (or HMP) in 2014 devices 

Fantastic peak performance 

■ Energy-efficient, sustainable compute for long running workloads 

■ Multi-processing 

■ Get ahead of the limits on single thread performance 
Avoid thermal constraints on performance 
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Single Instruction Multiple Data 




background imag e by Jan Mehlich - licenced under CC-SA 
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NEON is a wide SIMD data processing architecture 


■ Extension of the ARM® instruction set 

32 registers, 64-bits wide (dual view as 1 6 registers, 1 28-bits wide in ARMv7) 

■ NEON Instructions perform “Packed SIMD” processing 

Registers are considered as vectors of elements of the same data type 

Data types: signed/unsigned 8-bit, 1 6-bit, 32-bit, 64-bit, single/double prec., floating or integer 

■ Instructions perform the same operation in all lanes 






General purpose SIMD processing useful for many applications 


■ Supports widest range multimedia codecs used for internet applications 

■ "Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8/9, Real.AVS, . . . 

■ "Supports all internet and digital home standards in software 

■ Fewer cycles needed 

■ "NEON will give l.6x-2.5x performance on complex video codecs 

■ "Individual simple DSP algorithms can show larger performance boost (4x-8x) 

■ "Processor can sleep sooner => overall dynamic power saving 

■ Straightforward to program 

■ "Clean orthogonal vector architecture 

■ "Applicable to a wide range of data intensive computation. 

■ "Not just for codecs - applicable to 2D/3D graphics and other processing 

■ "Off-the-shelf Tools, OS, commercial & open source ecosystem support 
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Specific media intensive test case using Android NDK 
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NEON Visualizer 



http://szeged.github.com/nevada/ 


ARM 
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Don’t Reinvent the wheel! NEON in Open Source Today 


Google WebM - I 1,000 lines NEON assembler! 

Bluez - official Linux Bluetooth protocol stack 
Pixman (part of cairo 2D graphics library) 

■ ffmpeg (libav) - libavcodec 

LGPL media player used in many Linux distros and products 

Extensive NEON optimizations 

x264 - Google Summer Of Code 2009 

GPL H.264 encoder - e.g. for video conferencing 

Android - NEON optimizations 

Skia library, S32A_D565_Opaque 5x faster using NEON 

Available in Google Skia tree from 03-Aug-2009 

LLVM - code generation backend used by Android RenderScript 

Eigen2 - C++ vector math / linear algebra template library 

TheorARM - libtheora NEON version (optimized by Google) 

libjpeg / libjpeg-turbo - optimized JPEG decode 

libpng - optimized PNG decode 

■ FFTW - NEON enabled FFT library 

Liboil / liborc - runtime compiler for SIMD processing 
webkit/blink - used by Chrome Browser 
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Optimization Paths for Neon 


■ Opensource libraries, e.g. OpenMAX, libav, libjpeg, Android Skia, etc. 

Freely available Open Source optimizations 

■ Vectorizing Compilers 

Exploits NEON SIMD automatically with existing source code 

■ Status: Released (in DS-5 armcc, CodeSourcery, Linaro gcc and now LLVM) 

■ NEON C Instrinsics 

■ C function call interface to NEON operations 

Supports all data types and operations supported by NEON 

■ Status: Released (in DS-5 and gcc), LLVM/Clang under development 

■ Assembler 

■ For those who really want to optimize at the lowest level 
Status: Released (in DS-5 and gcc/gas) 

■ Commercial vendors 

Optimized and supported off-the-shelf packages 


17 


ARM 


Introducing NEIO 


■ Ne 1 0 is designed to provide a set of 
common, useful functions which 

have been optimised forARMv7 and NEON, many 
v8 functions available in intrinsic C 

provide consistent well tested behaviour 

and that can be easily incorporated into 
applications 

Is targeted at Android and Linux to maximize app 
performance and tested under iOS 

■ Features 

■ Usable from C/C++ and Java/JNI 

The library is modular; functionality that is not 
required within an App can be discarded 

Functions similar to the Accelerate Framework 
provided by iOS 


18 


It is Free 

■ No commercial complications- ‘build and ship’ 
BSD License 

well-tested behavior with example code 

Use of the NelO library should be a 
joy, not a chore 

Out-of-box and user experience is critical to 
success 

Build and go, accessible documentation, clear 
code 

Supported by ARM, community contributions 
welcome 


ARM 


Ne I ODroid — The App in action 


* NE I ODroid is a benchmarking Android App that 
uses NEIO. 

■ Routines are written usingVFP in C,VFP in 
Assembly and NEON. 


■Example routines: 

arm_result_t normal! ze_vec2f (arm_vec2f_t 
dst, arm_vec2f_t * src, unsigned int 
count ) } 

arm_result_t norma.liz&_vec3£ (arm_vec3f_t 
dst, arm_vec3f_t * src, unsigned int 
count) ; 

arm_result_t norma llze_vec4f (arm_vec4f_t 
dst, arm_vec4f_t * src, unsigned int 
count) ; 


* 


* 


* 
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A Little Taxonomy 


■ ARMv{Version/Extension/Class} - Generic Architecture Name 

■ ARMv8-A - ARM architecture version 8, application class 

■ AArch64 - 64-bit execution state 

■ A64 - ARM instruction set 

■ LP64 - 64-bit data model 

■ ILP32 - 32-bit data model 

■ AArch32 - 32-bit execution state 

■ A32 - ARM instruction set 
T32 -Thumb instruction set 

■ ILP32 - 32-bit data model 

■ Interprocessing - Interaction of execution environments 
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A Little Perspective 


VFPv3/v4 


NEON™ 
Adv SIMD 
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Use this one or previous one? 




1 

CRYPTO 

■ ; f crypto ■ 

. ' 1 V ' 


1 

1 

1 
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Key feature 
ARMv7-A 
compatibility 



A32+T32 ISAs 


including: 
Scalar FP 
(SP and DP) 
Adv SIMD 
(SP Float) 


A64 ISA 


including: 
Scalar FP 
(SP and DP) 
Adv SIMD 
(SP+DP Float) 




AArch32 


AArch64 




ARMv8-A 


ARM 













A Closer Look 


■ ARMv8-A is one of the most significant 
architecture changes in ARM’s history 

■ AArch64 can access ALL ARMv8-A 
features 

Larger address space (>4GB memory for Application) 
Wider data register (64-bit) 

■ Better SIMD (NEON) 

New Crypto Instructions 

More data registers (3 1 general, 32 FP/SIM D/Crypto) 

■ More... 

■ In the long term, delivers an unified 
architecture across Mobile, Client 
(Table/Desktop) and Enterprise markets 
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LargePhysAddrExtn 

Virtual ization Extn 
TrustZone 

ARM+Thumb ISAs 

NEON 

Hard Float 
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ARMv7-A 



32-bitVA; <40-bit PA | >3 2-bit VA; <48-bit PA 

4KB pages {4, 1 6, 64}KB pages 


EL3, EL2, EL I and ELO exception hierarchy 




LD acquire/ST release: C lx/C++ I I compliance 


IEEE 754-2008 compliant floating point 

AdvSIMD ! AdvSIMD 

(SP float) . (SP+DP float) 
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Exception Levels & Interprocessing 


Exception 

Level 


32-bit compatibility 



Full AArch64 
support 



Monitor + 
Hypervisor + 
OS support 


ARMv8-A 


A Arch 64 




AArch64 




Monitor + 
Hypervisor 
support 

A Arch 64 


Hypervisor + 
OS + 

Application 

support 

AArch32 


Monitor only 


Full AArch32 
support 


^ A Arch 64 M 


AArch32 


Native 64-bit support 
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So, You Have a 32-bit Application... 


■ Moving to ARMv8-A 

No swap instruction, CPI 5 barriers, load/store multiple 

■ Load/store pair 

■ Moving to A64 

■ Mnemonically similar to A32 

■ More and larger registers 
No instruction predication 
Conditional select 

■ Dedicated return instructions 


■ Why move? 

Significant performance gains come with ARMv8-A 
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But You Do Not Use Assembly Language 


■ Beware of 

■ Object sizes in LP64 

■ Casting between pointer and non-pointer types 
Implicit type/size conversions 

Bit-wise manipulations 

■ Magic numbers 

■ Multiarch can be your friend 

■ Trust your compiler! 
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Toolchains 


■ ARM is actively involved in two major Open Source Compilers 

■ LLVM 

■ AArch64 supported upstream as of LLVM 3.3 

■ Ongoing work to address outstanding defects 

■ OpenCL™ support 

BuildbotS available http://lab.llvm. org:80 1 I /builders/ 

■ GCC 

■ AArch64 supported upstream as of GCC 4.8 

■ Support for dynamic linking, TLS, cross-compiler and glibc 

■ Support for C/C++ ABI and PCS 

■ ARM® NEON™ auto-vectorization and intrinsics 
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Questions? 


ARM 
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ThankYou! 


ARM 


